# Install and load packages
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
devtools::install_github("ebenmichael/augsynth")
## Using GitHub PAT from the git credential store.
## Skipping install of 'augsynth' from a github remote, the SHA1 (982f650b) has not changed since last install.
## Use `force = TRUE` to force installation
pacman::p_load(# Tidyverse packages including dplyr and ggplot2
tidyverse,
ggthemes,
augsynth,
gsynth)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# set seed
set.seed(44)
# load data
medicaid_expansion <- read_csv("data/medicaid_expansion.csv")
## Rows: 663 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): State
## dbl (3): year, uninsured_rate, population
## date (1): Date_Adopted
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(medicaid_expansion)
## [1] "State" "Date_Adopted" "year" "uninsured_rate"
## [5] "population"
For this project, you will explore the question of whether the Affordable Care Act increased health insurance coverage (or conversely, decreased the number of people who are uninsured). The ACA was passed in March 2010, but several of its provisions were phased in over a few years. The ACA instituted the “individual mandate” which required that all Americans must carry health insurance, or else suffer a tax penalty. There are four mechanisms for how the ACA aims to reduce the uninsured population:
In 2012, the Supreme Court heard the landmark case NFIB v. Sebelius, which principally challenged the constitutionality of the law under the theory that Congress could not institute an individual mandate. The Supreme Court ultimately upheld the individual mandate under Congress’s taxation power, but struck down the requirement that states must expand Medicaid as impermissible subordination of the states to the federal government. Subsequently, several states refused to expand Medicaid when the program began on January 1, 2014. This refusal created the “Medicaid coverage gap” where there are indivudals who earn too much to qualify for Medicaid under the old standards, but too little to qualify for the ACA subsidies targeted at middle-income individuals.
States that refused to expand Medicaid principally cited the cost as the primary factor. Critics pointed out however, that the decision not to expand primarily broke down along partisan lines. In the years since the initial expansion, several states have opted into the program, either because of a change in the governing party, or because voters directly approved expansion via a ballot initiative.
You will explore the question of whether Medicaid expansion reduced the uninsured population in the U.S. in the 7 years since it went into effect. To address this question, you will use difference-in-differences estimation, and synthetic control.
The dataset you will work with has been assembled from a few different sources about Medicaid. The key variables are:
Create plots and provide 1-2 sentence analyses to answer the following questions:
# Highest and lowest uninsured rates
#Find years in dataset
unique(medicaid_expansion$year)
## [1] 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
# Filter data for all years prior to 2014
pre2014_data <- medicaid_expansion %>%
filter(year < 2014)
# Uninsured rates prior to 2014 (Going to take the mean of all years prior, not taking it year by year, will update if this is a mistake)
pre2014_uninsuredrate <- pre2014_data %>%
group_by(State) %>%
summarize(mean_uninsured_rate = mean(uninsured_rate)) %>%
arrange(desc(mean_uninsured_rate)) %>%
mutate(State = factor(State, levels = State))
#Find top and bottom quartiles (calling this the highest and lowest, since I don't want to plot all 50 states)
# Calculate the 25th and 75th percentiles (Quartiles)
q1 <- quantile(pre2014_uninsuredrate$mean_uninsured_rate, 0.25)
q3 <- quantile(pre2014_uninsuredrate$mean_uninsured_rate, 0.75)
# Filter states in the bottom and top quartiles based on mean uninsured rate
bottom_quartile2014 <- pre2014_uninsuredrate %>%
filter(mean_uninsured_rate <= q1) %>%
mutate(Quartile = 'Bottom Quartile')
top_quartile2014 <- pre2014_uninsuredrate %>%
filter(mean_uninsured_rate >= q3) %>%
mutate(Quartile = 'Top Quartile')
# Combine both quartiles into a single data frame
combined_quartiles <- bind_rows(bottom_quartile2014, top_quartile2014)
# Create a plot for both top and bottom quartiles
ggplot(combined_quartiles, aes(x = reorder(State, mean_uninsured_rate), y = mean_uninsured_rate, fill = Quartile)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(title = 'States with the Highest and Lowest Uninsured Rates Prior to 2014',
x = 'State',
y = 'Mean Uninsured Rate (%)') +
scale_fill_manual(values = c('Bottom Quartile' = 'lightgreen', 'Top Quartile' = 'lightcoral')) +
theme_minimal()
I decided to plot uninsured population rates pre-2014 as an average of
those rates over the pre-2014 years in the study period (2008-2013). I
understand that I lose some naunce here because rates could be way
higher/lower at some point in this pre-2014 period leading to a
deceiving average, whereas looking at this yearly would allow me to see
those trends over time pre-2014, but for the sake of exploratory
analyses, I like this visual. I will return to this if it seems like
later on I should be looking year to year. It looks like the states with
the lowest rates of uninsured individuals pre-2014 are mostly
northern/midwestern states, and the highest rates of uninsured
individuals are southern/western states. P.S., it’s noted that D.C. is
missing population data. If that becomes an issue, I’ll take it out of
the dataset.
library(dplyr)
# most uninsured Americans
# Calculate total uninsured population for each state prior to 2014
pre2014_uninsuredrate <- pre2014_data %>%
group_by(State) %>%
summarize(
mean_uninsured_rate = mean(uninsured_rate),
population = first(population) #Since there's only one unique value for each state (because it's based on 2010 population), just choosing the first
) %>%
mutate(
total_uninsured_population = population * mean_uninsured_rate
) %>%
arrange(desc(total_uninsured_population)) %>%
mutate(State = factor(State, levels = State))
#Caluclate total uninsured population for each state in 2020
uninsured_2020 <- medicaid_expansion %>%
filter(year == 2020) %>%
group_by(State) %>%
summarize(
uninsured_rate_2020 = first(uninsured_rate), # Use the uninsured rate directly for 2020
population = first(population) # Use the population for the state
) %>%
mutate(
total_uninsured_population_2020 = population * uninsured_rate_2020
) %>%
arrange(desc(total_uninsured_population_2020))
glimpse(pre2014_uninsuredrate)
## Rows: 51
## Columns: 4
## $ State <fct> California, Texas, Florida, New York, Georg…
## $ mean_uninsured_rate <dbl> 0.1796478, 0.1992837, 0.2068583, 0.1145605,…
## $ population <dbl> 38802500, 26956958, 19893297, 19746227, 100…
## $ total_uninsured_population <dbl> 6970785.1, 5372081.4, 4115094.3, 2262137.6,…
glimpse(uninsured_2020)
## Rows: 51
## Columns: 4
## $ State <chr> "Texas", "California", "Florida", "Geo…
## $ uninsured_rate_2020 <dbl> 0.184, 0.077, 0.132, 0.134, 0.113, 0.0…
## $ population <dbl> 26956958, 38802500, 19893297, 10097343…
## $ total_uninsured_population_2020 <dbl> 4960080.3, 2987792.5, 2625915.2, 13530…
# Filter states in the top quartiles pre-2014 and in 2020 based on total uninsured population rates
library(ggplot2)
library(scales) # for comma format
library (tidyr)
#Define cutoffs
q3_2014 <- quantile(pre2014_uninsuredrate$total_uninsured_population, 0.75, na.rm = TRUE)
q3_2020 <- quantile(uninsured_2020$total_uninsured_population_2020, 0.75, na.rm = TRUE)
#Filter states in the top quartile for both time periods
pre2014_top <- pre2014_uninsuredrate %>%
mutate(Year = "Pre-2014") %>%
mutate(in_top_quartile_2014 = total_uninsured_population >= q3_2014) %>%
select(State, total_uninsured_population_2014 = total_uninsured_population, in_top_quartile_2014)
post2020_top <- uninsured_2020 %>%
mutate(Year = "2020") %>%
mutate(in_top_quartile_2020 = total_uninsured_population_2020 >= q3_2020) %>%
select(State, total_uninsured_population_2020, in_top_quartile_2020)
# Full join on State to preserve all states that were in either top quartile
combined_all_states <- full_join(pre2014_top, post2020_top, by = "State")
# Filter out states that weren't in the top quartile for either time period
combined_all_states <- combined_all_states %>%
filter(in_top_quartile_2014 == TRUE | in_top_quartile_2020 == TRUE)
# Reformat data for plotting
plot_data <- combined_all_states %>%
pivot_longer(cols = starts_with("total_uninsured_population"),
names_to = "Year",
values_to = "total_uninsured_population") %>%
mutate(Year = recode(Year,
"total_uninsured_population_2014" = "Pre-2014",
"total_uninsured_population_2020" = "2020"))
# Create a plot for both top quartiles, adding the data for a state if it was in the top quartile for either time period
ggplot(plot_data, aes(x = reorder(State, -total_uninsured_population, na.rm = TRUE),
y = total_uninsured_population,
fill = Year)) +
geom_bar(stat = "identity", position = "dodge", na.rm = TRUE) +
labs(title = "Top Quartile States (Pre2014 and/or 2020) by Total Uninsured Population",
x = "State",
y = "Total Uninsured Population") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
It looks like the total population of uninsured people went down from
pre-2014 levels for all states in 2020. California seems to have a huge
difference, Florida and New York too, and then Texas looks like it
didn’t change that much. It’s important to note that the graph includes
states that were in the top quartile of total uninsured individuals for
either pre-2014 or in 2020, so some of these states might not have been
in the top quartile for both time periods.
Do the following:
#Find out when states adopted medicaid expansion, create a column indicating whether or not state ever adopted
medicaid_adoption <- medicaid_expansion %>%
distinct(State, Date_Adopted) %>%
mutate(
Adopted = ifelse(is.na(Date_Adopted), "Never Expanded", "Expanded")
) %>%
arrange(Date_Adopted)
#Merge with dataset
medicaid_expansion$Date_Adopted <- as.Date(medicaid_expansion$Date_Adopted)
medicaid_adoption$Date_Adopted <- as.Date(medicaid_adoption$Date_Adopted)
medicaid_expansion_full <- medicaid_expansion %>%
left_join(medicaid_adoption %>% select(State, Date_Adopted, Adopted), by = "State") %>%
select(-Date_Adopted.x) %>%
rename(Date_Adopted = Date_Adopted.y)
colnames(medicaid_expansion_full)
## [1] "State" "year" "uninsured_rate" "population"
## [5] "Date_Adopted" "Adopted"
#When it says "choose a state that did not" I'm assuming this means choose a state that never adopted the expansion, not a state that adopted the expansion just not between 2014-205 (e.g. 2020)
#Find states that adopted expansion on 1/1/2014 and those who didn't at all, create a new dataframe to work on
compare_states <- medicaid_expansion_full %>%
filter(Date_Adopted == as.Date("2014-01-01") | Adopted == "Never Expanded")
compare_states <- compare_states %>%
left_join(pre2014_uninsuredrate %>%
select(State, total_uninsured_population, mean_uninsured_rate),
by = "State")
I had such a hard time trying to figure out how to show parallel trends lol. As I’ve griped about in class, I feel like parallel trends are pretty unconvincing, it’s pretty easy for me to think something fails the eye test unless the lines are, like, really similar.
#Parallel trends plot
parallel_data <- compare_states %>%
filter(State %in% c("California", "Georgia")) %>%
select(State, year, uninsured_rate)
parallel_data <- parallel_data %>%
mutate(year = as.integer(year))
ggplot(parallel_data, aes(x = year, y = uninsured_rate, color = State)) +
geom_line(size = 1.2) +
geom_point() +
scale_x_continuous(breaks = seq(min(parallel_data$year), 2014, 1),
labels = as.character(seq(min(parallel_data$year), 2014, 1))) +
coord_cartesian(xlim = c(min(parallel_data$year), 2014)) + # truncate the plot at 2014
labs(title = "Pre-2014 Uninsured Rate Trends: California vs. Georgia",
x = "Year",
y = "Uninsured Rate",
color = "State") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Find two states, one that never adopted the expansion and one that adopted 1/1/2014 that are similar in terms of uninsured_rates, populations, and total_uninsured populations prior to 2014
compare_states %>%
filter(State %in% c("California", "Georgia")) %>%
distinct(State, population, Date_Adopted, mean_uninsured_rate, total_uninsured_population) %>%
print()
## # A tibble: 2 × 5
## State population Date_Adopted mean_uninsured_rate total_uninsured_popul…¹
## <chr> <dbl> <date> <dbl> <dbl>
## 1 California 38802500 2014-01-01 0.180 6970785.
## 2 Georgia 10097343 NA 0.192 1935272.
## # ℹ abbreviated name: ¹total_uninsured_population
This was the best I could do… the parallel trends look visually the best out of any pair I tested (I spent like an hour and a half trying), but I know obviously that there are reasons to believe that Georgia and California are very different. That being said, when weighing the visual parallel trends vs. our real world knowledge of these policy contexts, which is important? For example, I kept trying jurisdictions that seemed similar enough (Kentucky vs. Alabama) but their parallel trends visually just couldn’t be justified to me. Not sure if this was the better move or if I should’ve picked a “worse” trend visually that made more sense in the real world.
# Difference-in-Differences estimation
#Create dataframe
cg <- compare_states %>%
filter(State %in% c("California", "Georgia")) %>%
filter(year %in% c(2013, 2014))
#Pre-treatment difference
pre_diff <- cg %>%
filter(year == 2013) %>%
select(State, uninsured_rate) %>%
pivot_wider(names_from = State, values_from = uninsured_rate) %>%
summarise(diff = California - Georgia)
#post-treatment diference
post_diff <- cg %>%
filter(year == 2014) %>%
select(State, uninsured_rate) %>%
pivot_wider(names_from = State, values_from = uninsured_rate) %>%
summarise(diff = California - Georgia)
#DiD
diff_in_diffs <- post_diff - pre_diff
diff_in_diffs
## diff
## 1 -0.01648
I think this means that California’s uninsured rate went down bt -0.01648 from the pre-treatment year (2013) to the post-treatment year (2014). I will come back and do this with averages over pre and post treatment years if I have time…
Card/Krueger’s original piece utilized the fact that towns on either side of the Delaware river are likely to be quite similar to one another in terms of demographics, economics, etc. Why is that intuition harder to replicate with this data? Like discussed above, the trade-off between choosing two states whose parallel trend lines are more convincing and ones whose policy context seem similar is actually more difficult than one might imagine; as we saw above, places with seemingly similar policy contexts (neighboring states with similar populations and mean rates of uninsured population) often had really wonky parallel trends that were often thrown off by just one year. Maybe it’s not even fair to assume that geographically neighboring or similar (e.g. comparing two midwestern states) states would have similar policy contexts, but it definitely seems more intuitive that Georgia is more similar to Florida than it is to California. But I think that would be the intuition of Card/Krueger that can’t be applied here; two neighboring towns are probably more likely to be similar than two neighboring states, but even that feels like a stretch. Two towns that are split by some geographical feature might have differences - see the construction of highways in Oakland.
What are the strengths and weaknesses of using the parallel trends assumption in difference-in-differences estimates? I think the strength is how powerful the visual aspect of parallel trends is. Even though it is subjective, if you see two lines that basically have the same exact slope,it’s hard to argue with that. My California/Georgia example may not be the best, but if it were a little better visually, the existence of the parallel trends visually will kind of rebut claims that these places are too different to compare. Meaning, if the outcome of interest is trending in the same exact ways in these two places, it seems likely that no matter how different they are, covariates are seemingly impacting them in the same way. I think the weakness for me, as mentioned, is also the visual aspect and the fact that it is not really a mathematical formulation of “parallelness,” like there’s no universal standard, it’s just sort of a visual thing that your reviewers may or may not be convinced by. I guess the upshot of that is anyone who is doing serious diff-in-diff work probably is really good at defending their analytical choices to stave off criticisms for those who are typically unconvinced by parallel trends!
Estimate Synthetic Control
Although several states did not expand Medicaid on January 1, 2014, many did later on. In some cases, a Democratic governor was elected and pushed for a state budget that included the Medicaid expansion, whereas in others voters approved expansion via a ballot initiative. The 2018 election was a watershed moment where several Republican-leaning states elected Democratic governors and approved Medicaid expansion. In cases with a ballot initiative, the state legislature and governor still must implement the results via legislation. For instance, Idaho voters approved a Medicaid expansion in the 2018 election, but it was not implemented in the state budget until late 2019, with enrollment beginning in 2020.
Do the following:
# non-augmented synthetic control
#Find delayed adopters
adopted_post2014 <- medicaid_expansion_full %>%
filter(Date_Adopted > as.Date("2014-12-31")) %>%
distinct(State, Date_Adopted)
view(adopted_post2014)
#I'm not sure if I should be using some theory-informed state for the synthetic control, but my gut is telling me to pick a state that adopted on New Years day of some year. Earlier in the notebook it said don't choose 2014 or 2015 for DiD, so I won't choose either. Therefore, I'll choose 01-01-2016 as my date, which hopefully will still give me enough data post-treatment.
treated_state <- "Montana"
treatment_year <- medicaid_expansion_full %>%
filter(State == treated_state) %>%
pull(Date_Adopted) %>%
unique() %>%
as.Date() %>%
format("%Y") %>%
as.numeric()
# Clean data
medicaid_clean <- medicaid_expansion_full %>%
mutate(
treatment = ifelse(State == treated_state & year >= treatment_year, 1, 0)
)
# Run non-augmented synthetic control
syn <- augsynth(
uninsured_rate ~ treatment, # Outcome variable and treatment indicator
unit = State, # Unit (states)
time = year, # Time variable (year)
data = medicaid_clean, # Data frame
progfunc = "none", # No augmentation
scm = TRUE # Enable synthetic control
)
## One outcome and one treatment time found. Running single_augsynth.
#Save ATT and L2 imbalance
# Extract the ATT and L2 Imbalance from the summary text (I was having issues with getting the ATT out of the summary)
summary_text <- capture.output(summary(syn))
att_line <- summary_text[grep("Average ATT Estimate", summary_text)]
l2_line <- summary_text[grep("L2 Imbalance", summary_text)]
# Extract the numeric values for ATT and L2 Imbalance
non_aug_att <- as.numeric(str_extract(att_line, "[-]?\\d+\\.\\d+"))
non_aug_l2 <- as.numeric(str_extract(l2_line, "[-]?\\d+\\.\\d+"))
# Print out the results
cat("Non-Augmented ATT:", non_aug_att, "\n")
## Non-Augmented ATT: -0.0189
cat("Non-Augmented L2 Imbalance:", non_aug_l2, "\n")
## Non-Augmented L2 Imbalance: 0.009
# plot
plot(syn)
weights_df <- data.frame(syn$weights) %>%
tibble::rownames_to_column('State') %>%
rename(syn_weight = syn.weights)
# Plot weights
library(ggplot2)
library(ggthemes)
ggplot(weights_df) +
geom_col(aes(x = State, y = syn_weight), fill = "steelblue") +
coord_flip() +
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
labs(
title = "Synthetic Control Weights",
x = "State",
y = "Weight"
)
Okay, only a few states are donors (is that the right terminology?) for
our synthetic control.
# view each state's contribution, where weights are greater than 0
# ---------
data.frame(syn$weights) %>%
# processing
# ---------
tibble::rownames_to_column('State') %>%
filter(syn.weights > 0) %>% # filter out weights less than 0
# plot
# ---------
ggplot() +
geom_bar(aes(x = State,
y = syn.weights),
stat = 'identity') +
coord_flip() + # flip to make it more readable
# themes
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
# labels
ggtitle('Synthetic Control Weights') +
xlab('State') +
ylab('Weight')
#Use tidysynth to visualize
library(tidysynth)
medicaid_out <- medicaid_clean %>%
synthetic_control(
outcome = uninsured_rate,
unit = State,
time = year,
i_unit = "Montana", # Treated unit
i_time = 2016, # Intervention time
generate_placebos = TRUE
) %>%
# Use a different approach to generate predictors
generate_predictor(
time_window = 2008:2013,
avg_uninsured = mean(uninsured_rate, na.rm = TRUE) # Use a summary statistic instead
) %>%
generate_weights(optimization_window = 2008:2015) %>%
generate_control()
#Plot trends for Montana vs synthetic Montana
medicaid_out %>% plot_trends()
#Plot the differences (gap between actual and synthetic Montana)
medicaid_out %>% plot_differences()
#Plot placebos (other donor states) vs observed Montana
medicaid_out %>% plot_placebos()
#Plot control weights of each other state
medicaid_out %>% plot_weights()
…uhhh why is Nevada the only donor state basically… also why is it the
avg_unisnured
# augmented synthetic control
aug_syn <- augsynth(
uninsured_rate ~ treatment,
unit = State,
time = year,
data = medicaid_clean,
progfunc = "ridge", # Using ridge regression for augmentation
scm = TRUE
)
## One outcome and one treatment time found. Running single_augsynth.
# Extract the ATT and L2 Imbalance using the same approach as before (same issue getting ATT)
summary_aug_text <- capture.output(summary(aug_syn))
aug_att_line <- summary_aug_text[grep("Average ATT Estimate", summary_aug_text)]
aug_l2_line <- summary_aug_text[grep("L2 Imbalance", summary_aug_text)]
# Extract the numeric values
aug_att <- as.numeric(str_extract(aug_att_line, "[-]?\\d+\\.\\d+"))
aug_l2 <- as.numeric(str_extract(aug_l2_line, "[-]?\\d+\\.\\d+"))
# Print out the results
cat("Augmented ATT:", aug_att, "\n")
## Augmented ATT: -0.0189
cat("Augmented L2 Imbalance:", aug_l2, "\n")
## Augmented L2 Imbalance: 0.009
cat("Non-Augmented L2 Imbalance (for comparison):", non_aug_l2, "\n")
## Non-Augmented L2 Imbalance (for comparison): 0.009
plot(aug_syn)
# barplots of weights
weights_plot <- data.frame(aug_syn$weights) %>%
tibble::rownames_to_column('State') %>%
filter(aug_syn.weights > 0) %>%
ggplot() +
geom_bar(aes(x = reorder(State, aug_syn.weights), y = aug_syn.weights),
stat = 'identity', fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(
title = "Augmented Synthetic Control Weights",
x = "State",
y = "Weight"
)
print(weights_plot)
…why are none of the weights negative
# Create a comparison plot between actual, non-aug synthetic, and ridge Montana
#Kasey I tried so hard to put all three on the same graph using the 6-6 notebook and it just would not work I'm sorry... so here's ridge vs observed
# Create a tidysynth object for the ridge-augmented version
medicaid_out_ridge <- medicaid_clean %>%
synthetic_control(
outcome = uninsured_rate,
unit = State,
time = year,
i_unit = "Montana",
i_time = 2016,
generate_placebos = TRUE
) %>%
# Use ridge regression for augmentation
generate_predictor(
time_window = 2008:2013,
avg_uninsured = mean(uninsured_rate, na.rm = TRUE)
) %>%
# In tidysynth, ridge augmentation can be specified in the weights function
generate_weights(
optimization_window = 2008:2015,
ridge_penalty = 0.1 # Add a ridge penalty
) %>%
generate_control()
# Plot the ridge-augmented results
medicaid_out_ridge %>% plot_trends()
medicaid_out_ridge %>% plot_differences()
# Create a dataset specifically comparing the two synthetic controls, ridge and non-aug
# ---------
# Get summaries from both models
syn_sum <- summary(syn)
aug_syn_sum <- summary(aug_syn)
# Create Montana with synthetic controls - IMPORTANT: Subtract ATT not add it
montana_synmontana <-
# data
medicaid_clean %>%
# filter just Montana
filter(State == "Montana") %>%
# Create synthetic Montana values by SUBTRACTING the ATT
# (since negative ATT means uninsured rate decreased)
mutate(
synthetic_montana = uninsured_rate + non_aug_att,
aug_synthetic_montana = uninsured_rate + aug_att
)
# Plot all three lines
montana_synmontana %>%
ggplot() +
# Montana actual
geom_line(aes(x = year,
y = uninsured_rate,
color = 'Montana')) +
# Synthetic Montana (non-augmented)
geom_line(aes(x = year,
y = synthetic_montana,
color = 'Synthetic Montana')) +
# Augmented Synthetic Montana
geom_line(aes(x = year,
y = aug_synthetic_montana,
color = 'Augmented Synthetic Montana')) +
# Use scale color manual to assign color values
scale_color_manual(values = c('Montana' = 'red',
'Synthetic Montana' = 'blue',
'Augmented Synthetic Montana' = 'green')) +
geom_vline(aes(xintercept = 2016)) +
# themes
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
# labels
ggtitle('Montana, Synthetic Montana, and Augmented Synthetic Montana') +
xlab('Year') +
ylab('Uninsured Rate')
Kasey I tried so hard to get synthetic Monatana to show up but I just
could not…. HINT: Is there any preprocessing you need
to do before you allow the program to automatically find weights for
donor states?
What are the advantages and disadvantages of synthetic control compared to difference-in-differences estimators?
Answer:
One of the benefits of synthetic control is that the weights are bounded between [0,1] and the weights must sum to 1. Augmentation might relax this assumption by allowing for negative weights. Does this create an interpretation problem, and how should we balance this consideration against the improvements augmentation offers in terms of imbalance in the pre-treatment period?
Answer:
Do the following:
# multisynth model states
# Prepare data for multisynth
medicaid_multi_ready <- medicaid_expansion_full %>%
# Create binary treatment indicator based on adoption year
mutate(
cbr = ifelse(!is.na(Date_Adopted) & year >= as.numeric(format(Date_Adopted, "%Y")), 1, 0)
)
# Run multisynth with default nu
# ---------
ppool_syn <- multisynth(uninsured_rate ~ cbr,
State, # unit
year, # time
medicaid_multi_ready, # data
n_leads = 6) # post-treatment periods to estimate
# View results
print(ppool_syn$nu)
## [1] 0.2933713
# Get summary stats
# ---------
ppool_syn_summ <- summary(ppool_syn)
# Plot all states (this might be crowded)
# ---------
ppool_syn_summ$att %>%
ggplot(aes(x = Time, y = Estimate, color = Level)) +
geom_point() +
geom_line() +
geom_vline(xintercept = 0) +
theme_minimal() +
theme(axis.title = element_text(),
legend.position = "bottom") +
ggtitle('Synthetic Controls for Medicaid Expansion') +
xlab('Time') +
ylab('Uninsured Rate Estimate')
## Warning: Removed 244 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 244 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Plot with facets for better readability
# ---------
ppool_syn_summ$att %>%
ggplot(aes(x = Time, y = Estimate, color = Level)) +
geom_point() +
geom_line() +
geom_vline(xintercept = 0) +
theme_minimal() +
theme(axis.title = element_text(),
legend.position = 'None') +
ggtitle('Synthetic Controls for Medicaid Expansion') +
xlab('Time') +
ylab('Uninsured Rate Effect') +
facet_wrap(~Level) # facet-wrap by level (state in this case) for clearer presentation
## Warning: Removed 244 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 244 rows containing missing values or values outside the scale range
## (`geom_line()`).
# multisynth model time cohorts
# Run multisynth by time cohorts
# ---------
ppool_syn_time <- multisynth(uninsured_rate ~ cbr,
State,
year,
medicaid_multi_ready,
n_leads = 6,
time_cohort = TRUE)
# Get summary stats
# ---------
ppool_syn_time_summ <- summary(ppool_syn_time)
# Plot cohort effects
# ---------
ppool_syn_time_summ$att %>%
ggplot(aes(x = Time, y = Estimate, color = Level)) +
geom_point() +
geom_line() +
geom_vline(xintercept = 0) +
theme_minimal() +
theme(axis.title = element_text(),
legend.position = "bottom") +
ggtitle('Synthetic Controls for Medicaid Expansion by Adoption Cohort') +
xlab('Time') +
ylab('Uninsured Rate Effect')
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_line()`).
ppool_syn_time_summ$att %>%
ggplot(aes(x = Time, y = Estimate, color = Level)) +
geom_point() +
geom_line() +
geom_vline(xintercept = 0) +
theme_fivethirtyeight() + # Make sure this theme is loaded
theme(axis.title = element_text(),
legend.position = 'None') +
ggtitle('Synthetic Controls for Medicaid Expansion') +
xlab('Time') +
ylab('Uninsured Rate Effect') +
facet_wrap(~Level)
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_line()`).
One feature of Medicaid is that it is jointly administered by the federal government and the states, and states have some flexibility in how they implement Medicaid. For example, during the Trump administration, several states applied for waivers where they could add work requirements to the eligibility standards (i.e. an individual needed to work for 80 hours/month to qualify for Medicaid). Given these differences, do you see evidence for the idea that different states had different treatment effect sizes?
Answer:
Do you see evidence for the idea that early adopters of Medicaid expansion enjoyed a larger decrease in the uninsured population?
Answer:
Why are DiD and synthetic control estimates well suited to studies of aggregated units like cities, states, countries, etc?
Answer:
What role does selection into treatment play in DiD/synthetic control versus regression discontinuity? When would we want to use either method?
Answer: